Asr-driven Binary Mask Estimation for Robust Automatic Speech Recognition
نویسنده
چکیده
Additive noise has long been an issue for robust automatic speech recognition (ASR) systems. One approach to noise robustness is the removal of noise information through segregation by binary time-frequency masks; each time-frequency unit in a spectro-temporal representation of the speech signal is labeled either noise-dominant or signal-dominant. The noise-dominant units are masked and their energy is removed from the signal. The ideal binary mask, computed given oracle information regarding the speech and noise sources, has been shown to provide significant improvements in speech intelligibility for humans. In this work, we investigate both methods of incorporating binary masks in ASR and methods for estimating the binary mask. While applying binary masks to separation tasks for humans has been straightforward, the incorporation of binary masks in ASR has proved more difficult. The field of missing data ASR proposes methods to compensate for the effects of binary masks on cepstral feature calculation. We demonstrate, contrary to previous work, the direct use of the ideal binary mask performs at least as well as several missing data techniques when the acoustic features have been variance normalized. Typical methods for binary mask estimation focus on low level acoustic features and little work attempts to incorporate higher level linguistic information in the estimation process. We propose an alternative masking criterion that forces the use of higher level information called the ASR-driven binary mask. The mask is defined by force aligning the true
منابع مشابه
Robust automatic speech recognition with decoder oriented ideal binary mask estimation
In this paper, we propose a joint optimal method for automatic speech recognition (ASR) and ideal binary mask (IBM) estimation in transformed into the cepstral domain through a newly derived generalized expectation maximization algorithm. First, cepstral domain missing feature marginalization is established using a linear transformation, after tying the mean and variance of non-existing cepstra...
متن کاملNoise Robust Missing Data Mask Estimation Based on Automatically Learned Features
ABSTRACT In this work, we present a missing feature reconstruction based automatic speech recognition (ASR) system in which masks are estimated by binary classification of features generated by GaussianBernoulli restricted Boltzmann machines (GRBMs). The system is evaluated on Track 1 of the 2nd CHiME challenge data. Overall, the best performance is achieved when the reconstructed speech featur...
متن کاملThe role of binary mask patterns in automatic speech recognition in background noise.
Processing noisy signals using the ideal binary mask improves automatic speech recognition (ASR) performance. This paper presents the first study that investigates the role of binary mask patterns in ASR under various noises, signal-to-noise ratios (SNRs), and vocabulary sizes. Binary masks are computed either by comparing the SNR within a time-frequency unit of a mixture signal with a local cr...
متن کاملMask estimation in non-stationary noise environments for missing feature based robust speech recognition
In missing feature based automatic speech recognition (ASR), the role of the spectro-temporal mask in providing an accurate description of the relationship between target speech and environmental noise is critical for minimizing the degradation in ASR word accuracy (WAC) as the signal-to-noise ratio (SNR) decreases. This paper demonstrates the importance of accurate characterization of instanta...
متن کاملOn the Role of Binary Mask Pattern in Automatic Speech Recognition
Processing noisy signals using the ideal binary mask has been shown to improve automatic speech recognition (ASR) performance. In this paper, we present the first study that investigates the role of mask patterns in ASR under varying signalto-noise ratios (SNR), noise conditions and mask definitions. Binary masks are typically computed either by comparing the local SNR within a time-frequency u...
متن کامل